In [24]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
plt.rcParams['figure.figsize'] = 9, 6
In [25]:
from sklearn.feature_selection import VarianceThreshold
In [26]:
X = np.array([[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]])
X
Out[26]:
In [27]:
selector = VarianceThreshold(threshold=0.0)
selector.fit_transform(X)
Out[27]:
In [28]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2 # daju sa pouzit ine metriky
iris = load_iris()
X, y = iris.data, iris.target
X.shape
Out[28]:
In [29]:
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape
Out[29]:
SelectPercentile
SelectFpr - false positive rate
SelectFwe - family wise error
GenericUnivariateSelect - Vsetko dohromady a strategia sa da nastavit parametrom
Sequential Forward Selection (SFS)
Postupne zvacsuje mnozinu atributov o ton, ktory najviac prispel k zlepseniu
Sequential Backward Selection (SBS)
Postupne zmensuje mnzoinu atributov o ten, ktory najmenej pomahal
Sequential Floating Forward Selection (SFFS)
SFS s pokusom o vyhodenie uz pridanych atributov ak sa ukaze ze velmi nepomahaju
Sequential Floating Backward Selection (SFBS)
SBS s pokusom o pridanie uz raz vyhodeneho atributu
In [30]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)
In [31]:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
sfs1 = SFS(knn, k_features=3, forward=True, floating=False, verbose=2, scoring='accuracy', cv=0)
# pomocou tejto triedy vieme robit SFS, SFFS, SBS aj SFBS a dokonca aj pridat cross-validaciu
sfs1 = sfs1.fit(X, y)
In [32]:
sfs1.subsets_
Out[32]:
In [33]:
sfs1.k_feature_idx_
Out[33]:
In [34]:
sfs1.k_score_
Out[34]:
In [35]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X, y = iris.data, iris.target
X.shape
Out[35]:
In [36]:
clf = RandomForestClassifier()
clf = clf.fit(X, y)
clf.feature_importances_
Out[36]:
In [37]:
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
X_new.shape
Out[37]:
Zvazte ktory sposob vyberu atributov sa hodi prave pre vas. Zalezi to hlavne od pouziteho algoritmu na vytvorenie modelu.
Ak pouzivate nejaky linearny model alebo les, tak je zbytocne robit filtre a este viac zbytocne robit wrappre.
Ak nemate cas na opakovane trenovanie modelu, tak filtre mozu byt dostatocny hotfix. Treba ale zvazit aku vlastnost atributov chcete pouzit na najdenie najdolezitejsich.
Ak mate cas spustit to trenovanie viac krat, tak asi najlepsia moznost je SFFS alebo SFECV
In [ ]: